{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Introduction to data manipulation and analysis with Pandas\n",
"\n",
"## Try me\n",
" \n",
" [](https://colab.research.google.com/github/ffraile/operations-research-notebooks/blob/main/docs/source/Introduction/libraries/Pandas%20tutorial.ipynb)[](https://mybinder.org/v2/gh/ffraile/operations-research-notebooks/main?labpath=docs%2Fsource%2FIntroduction%2Flibraries%2FPandas%20tutorial.ipynb)\n",
"\n",
"[Pandas](https://pandas.pydata.org/) is a powerful library that provides convenient data structures and functions to work with data. \n",
"Throughout the course, we will use Pandas for data manipulation and analysis.\n",
"To use Pandas in your project, you first need to install it in your environment. Additionally, in this tutorial we will import the display and Markdown libraries to display the dataframes as tables."
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install pandas\n",
"!pip install ipython"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [],
"source": [
"# Import pandas\n",
"import pandas as pd\n",
"\n",
"# Import display and Markdown\n",
"from IPython.display import display, Markdown\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## Introduction to Pandas data structures\n",
"This introduction covers the fundamentals of Pandas data structures used in this course. It is however recommended that you also take a look to the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro).\n",
"\n",
"There are primarily two types of Pandas data structures:\n",
"\n",
"- **Dataframes**: Dataframes are two-dimensional data structures, similar to a two-dimensional numpy array, a spreadsheet or an SQL table.\n",
"- **Series**: Series are one-dimensional data structures, similar to one-dimensional numpy arrays \n",
"\n",
"### Dataframes\n",
"Data frames facilitates the manipulation and analysis of two-dimensional data structures, combining convenient functions to:\n",
"\n",
"- **Extract** data from different sources, like CSV files, Excel sheets, or SQL databases\n",
"- **Transform** data, applying functions to change the values, or combining different data sources to prepare the data for our application\n",
"- **Load** data into output CSV files, Excel sheets or SQL databases\n",
"- **Analyse** Analyse the data \n",
"\n",
"Pandas columns in a dataframe are not necessarily of the same type. In the course, we will primarily work with numeric types, string types and the datetime type, so we will focus on those two types in this tutorial. We will also store complex objects, and find convenient Pandas functions to manipulate the dataframe using them, so you can see that Pandas is going to be very useful throughout the course!\n",
"\n",
"You can create a dataframe from different structures:\n",
"DataFrame accepts many kinds. Among others, in this course, we will primarily use:\n",
"\n",
"- A dictionary of one dimensional numpy arrays, lists, dicts, or Series\n",
"- CSV, or Excel files\n",
"- Another Pandas dataframe\n",
"- A Pandas series\n",
"\n",
"Let us start with a simple example, using a dictionary."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 2,
"outputs": [
{
"data": {
"text/plain": " one two three date\n0 1 5 Awesome 2021-01-01\n1 2 4 Brutal 2021-01-02\n2 3 3 Cool 2021-01-02\n3 4 2 Divine 2021-02-03\n4 5 1 Exquisite 2021-02-05",
"text/html": "
\n\n
\n \n \n | \n one | \n two | \n three | \n date | \n
\n \n \n \n | 0 | \n 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n
\n \n | 1 | \n 2 | \n 4 | \n Brutal | \n 2021-01-02 | \n
\n \n | 2 | \n 3 | \n 3 | \n Cool | \n 2021-01-02 | \n
\n \n | 3 | \n 4 | \n 2 | \n Divine | \n 2021-02-03 | \n
\n \n | 4 | \n 5 | \n 1 | \n Exquisite | \n 2021-02-05 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# This is a dictionary\n",
"dict = {\"one\": [1, 2, 3, 4, 5], \"two\": [5, 4, 3, 2, 1], \"three\":[\"Awesome\", \"Brutal\", \"Cool\", \"Divine\", \"Exquisite\"], \"date\": ['2021-01-01', '2021-01-02', '2021-01-02', '2021-02-03', '2021-02-05'] }\n",
"\n",
"# And this is a dataframe\n",
"dataframe = pd.DataFrame.from_dict(dict)\n",
"\n",
"display(dataframe)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Our Pandas dataframe, Pandas, has three columns, with names, 'one', 'two', and 'three'. Notice the integer number in the column to the left. This is the **index**, and by default, the index has no name. \n",
"\n",
"Another convenient way to create a dataframe is to use a set of records in an array:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 3,
"outputs": [
{
"data": {
"text/plain": " one two three date\n0 1 5 Awesome 2021-01-01\n1 2 4 Brutal 2021-01-02\n2 3 3 Cool 2021-01-02\n3 4 2 Divine 2021-02-03\n4 5 1 Exquisite 2021-02-05",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n three | \n date | \n
\n \n \n \n | 0 | \n 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n
\n \n | 1 | \n 2 | \n 4 | \n Brutal | \n 2021-01-02 | \n
\n \n | 2 | \n 3 | \n 3 | \n Cool | \n 2021-01-02 | \n
\n \n | 3 | \n 4 | \n 2 | \n Divine | \n 2021-02-03 | \n
\n \n | 4 | \n 5 | \n 1 | \n Exquisite | \n 2021-02-05 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# This is an array of dict\n",
"records = [{\"one\": 1, \"two\": 5, \"three\":\"Awesome\", \"date\":\"2021-01-01\"},\n",
" {\"one\": 2, \"two\": 4, \"three\":\"Brutal\", \"date\":\"2021-01-02\"},\n",
" {\"one\": 3, \"two\": 3, \"three\":\"Cool\", \"date\":\"2021-01-03\"},\n",
" {\"one\": 4, \"two\": 2, \"three\":\"Divine\", \"date\":\"2021-02-04\"},\n",
" {\"one\": 5, \"two\": 1, \"three\":\"Exquisite\", \"date\":\"2021-02-05\"}\n",
" ]\n",
"\n",
"# This is a dataframe\n",
"dataframe_2 = pd.DataFrame.from_records(records)\n",
"display(dataframe)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Another convenient function to create a dataset is loading a csv file:\n",
"\n",
"``` python\n",
"online_retail = pd.read_csv('datasets/Online Retail_2.csv', usecols=['InvoiceNo', 'StockCode', 'Quantity','InvoiceDate')\n",
"\n",
"```\n",
"This line of code will create a dataset from the file 'datasets/Online Retail_2.csv', loading only data from the columns labeled as 'InvoiceNo', 'StockCode', 'Quantity', and 'InvoiceDate'.\n",
"\n",
"#### Accessing data\n",
"There are multiple ways to access the data in a dataframe. Dataframe columns are **series**, the other important data structure in Pandas. You can access columns in a dataframe using column names as keys:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [
{
"data": {
"text/plain": "0 Awesome\n1 Brutal\n2 Cool\n3 Divine\n4 Exquisite\nName: three, dtype: object"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"names = dataframe['three']\n",
"display(names)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Notice that series are also indexed, you can access *elements* or *slices* just as in numpy:\n",
" "
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"source": [
"print(names[0])\n",
"print(names[1])\n",
"print(names[0:2])\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
},
"execution_count": 5,
"outputs": [
{
"name": "stdout",
"text": [
"Awesome\n",
"Brutal\n",
"0 Awesome\n",
"1 Brutal\n",
"Name: three, dtype: object\n"
],
"output_type": "stream"
}
]
},
{
"cell_type": "markdown",
"source": [
"Obviously, you can access the elements directly from the data frame:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"source": [
"print(dataframe['three'][0])"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% \n",
"is_executing": false
}
},
"execution_count": 6,
"outputs": [
{
"name": "stdout",
"text": [
"Awesome\n"
],
"output_type": "stream"
}
]
},
{
"cell_type": "markdown",
"source": [
"#### Queries\n",
"You can use queries to access parts of your dataframe, that is, to *filter* the data, using python expressions. \n",
"The function *query()* is used to issue the query. For instance, the following query finds all rows where the column 'one' is less or equal than two:\n",
" "
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 7,
"outputs": [
{
"data": {
"text/plain": " one two three date\n0 1 5 Awesome 2021-01-01\n1 2 4 Brutal 2021-01-02",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n three | \n date | \n
\n \n \n \n | 0 | \n 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n
\n \n | 1 | \n 2 | \n 4 | \n Brutal | \n 2021-01-02 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"filtered_df = dataframe.query('one <= 2')\n",
"display(filtered_df)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"You can combine different columns in the expressions, for instance, the following example filters data where the column 'two' is greater than 3 and the column 'three' is equal to \"Awesome\". The second example shows values where the column 'one' is equal to the column 'two'"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 8,
"outputs": [
{
"data": {
"text/plain": " one two three date\n0 1 5 Awesome 2021-01-01",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n three | \n date | \n
\n \n \n \n | 0 | \n 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": " one two three date\n2 3 3 Cool 2021-01-02",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n three | \n date | \n
\n \n \n \n | 2 | \n 3 | \n 3 | \n Cool | \n 2021-01-02 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"filtered_df2 = dataframe.query('two > 3 and three == \"Awesome\"')\n",
"display(filtered_df2)\n",
"filtered_df3 = dataframe.query('one == two')\n",
"display(filtered_df3)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
" \n",
"#### Vectorized functions\n",
"Pandas series have a huge set of vectorized functions you can apply to the entire series. For instance, the different functions in the numpy tutorial:\n",
"\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 9,
"outputs": [
{
"data": {
"text/plain": "15"
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": "0 1\n1 3\n2 6\n3 10\n4 15\nName: one, dtype: int64"
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": "3.0"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(dataframe['two'].sum())\n",
"display(dataframe['one'].cumsum())\n",
"display(dataframe['two'].mean())"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"One interesting function is ```describe()```, which returns another dataframe with statistical information about the series or dataframe:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": " one two\ncount 5.000000 5.000000\nmean 3.000000 3.000000\nstd 1.581139 1.581139\nmin 1.000000 1.000000\n25% 2.000000 2.000000\n50% 3.000000 3.000000\n75% 4.000000 4.000000\nmax 5.000000 5.000000",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n
\n \n \n \n | count | \n 5.000000 | \n 5.000000 | \n
\n \n | mean | \n 3.000000 | \n 3.000000 | \n
\n \n | std | \n 1.581139 | \n 1.581139 | \n
\n \n | min | \n 1.000000 | \n 1.000000 | \n
\n \n | 25% | \n 2.000000 | \n 2.000000 | \n
\n \n | 50% | \n 3.000000 | \n 3.000000 | \n
\n \n | 75% | \n 4.000000 | \n 4.000000 | \n
\n \n | max | \n 5.000000 | \n 5.000000 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 10
}
],
"source": [
"dataframe.describe()\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Another important vectorized function is unique(). We will use unique()"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"All string type functions are vectorised and can be accessed through *str*. For instance, the following example gets the first character of the string:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 11,
"outputs": [
{
"data": {
"text/plain": "0 AWESOME\n1 BRUTAL\n2 COOL\n3 DIVINE\n4 EXQUISITE\nName: three, dtype: object"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"upper_three = dataframe['three'].str.upper()\n",
"display(upper_three)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"*str* provides vectorized access to the string value of the series, and thus you can access parts of every string in a series using indexes:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 12,
"outputs": [
{
"data": {
"text/plain": "0 A\n1 B\n2 C\n3 D\n4 E\nName: three, dtype: object"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"first_character = dataframe['three'].str[0]\n",
"display(first_character)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Similar to str, *dt* implements vectorized functions to work with dates.The following examples first transforms column \"date\" into a datetime, and then calculates the month."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 13,
"outputs": [
{
"data": {
"text/plain": "0 2021-01\n1 2021-01\n2 2021-01\n3 2021-02\n4 2021-02\nName: date, dtype: period[M]"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"month = pd.to_datetime(dataframe['date']).dt.to_period(\"M\")\n",
"display(month)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"The function pd.to_datetime converts the series in column 'date' into a datetime. Then, dt.to_period() transforms this column into a period, in this case, the letter \"M\" represents *months*, and therefore, the period obtained is months.\n",
"\n",
"Finally, *apply* allows us to apply and *lambda* function taking all the values of the dataframe or just a series:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 14,
"outputs": [
{
"data": {
"text/plain": "0 2\n1 3\n2 4\n3 5\n4 6\nName: one, dtype: int64"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plus_one = dataframe['one'].apply(lambda x: x +1)\n",
"display(plus_one)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"In the code above, we apply a newly created lambda function that takes one parameter x and just adds one. apply() will apply this function to the series in column 'one'.\n",
"There is a huge set of functions available, and many things you can do which are not covered in this tutorial, so you better check the documentation at\n",
"[Pandas Series API Reference](https://pandas.pydata.org/docs/reference/series.html)\n",
"\n",
"#### Indexing\n",
"By default, the dataframes we have created have an integer *index* to access data in a raw, but we can define any type of index. Let us create another dataframe with another type of index:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 15,
"outputs": [
{
"data": {
"text/plain": " client_id postal_code\n2021-01-01 003 46021\n2021-01-02 005 46022\n2021-01-03 007 46022\n2021-02-04 003 46021\n2021-02-05 003 46027",
"text/html": "\n\n
\n \n \n | \n client_id | \n postal_code | \n
\n \n \n \n | 2021-01-01 | \n 003 | \n 46021 | \n
\n \n | 2021-01-02 | \n 005 | \n 46022 | \n
\n \n | 2021-01-03 | \n 007 | \n 46022 | \n
\n \n | 2021-02-04 | \n 003 | \n 46021 | \n
\n \n | 2021-02-05 | \n 003 | \n 46027 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data_2 = {\"2021-01-01\": ['003', '46021'], \"2021-01-02\": ['005', '46022'], \"2021-01-03\": ['007', '46022'], \n",
" \"2021-02-04\": ['003', '46021'], \"2021-02-05\": ['003', '46027']}\n",
"\n",
"dataframe_3 = pd.DataFrame.from_dict(data_2, orient='index', columns=['client_id', 'postal_code'])\n",
"display(dataframe_3)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"We have used two new parameters to create our dataframe from a dictionary (from_dict), *orient*, and *columns*. When *orient* is index, the dictionary keys are used as the index of the created dataframe. In this case, we are using a string. The columns parameters allow us to give a name to the columns of our dataframe. \n",
"Now, we can access the data using the new index:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 16,
"outputs": [
{
"name": "stdout",
"text": [
"003\n"
],
"output_type": "stream"
}
],
"source": [
"print(dataframe_3['client_id']['2021-01-01'])"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"We can also set the index of an existing dataframe using its columns. Let us change the index of our first dataframe:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 17,
"outputs": [
{
"data": {
"text/plain": " two three date\none \n1 5 Awesome 2021-01-01\n2 4 Brutal 2021-01-02\n3 3 Cool 2021-01-02\n4 2 Divine 2021-02-03\n5 1 Exquisite 2021-02-05",
"text/html": "\n\n
\n \n \n | \n two | \n three | \n date | \n
\n \n | one | \n | \n | \n | \n
\n \n \n \n | 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n
\n \n | 2 | \n 4 | \n Brutal | \n 2021-01-02 | \n
\n \n | 3 | \n 3 | \n Cool | \n 2021-01-02 | \n
\n \n | 4 | \n 2 | \n Divine | \n 2021-02-03 | \n
\n \n | 5 | \n 1 | \n Exquisite | \n 2021-02-05 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"new_dataframe = dataframe.set_index(['one'])\n",
"display(new_dataframe)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Notice, that since our column had a name, now our index also has a name. We can set the name of an index like:\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 18,
"outputs": [
{
"data": {
"text/plain": " client_id postal_code\ndate \n2021-01-01 003 46021\n2021-01-02 005 46022\n2021-01-03 007 46022\n2021-02-04 003 46021\n2021-02-05 003 46027",
"text/html": "\n\n
\n \n \n | \n client_id | \n postal_code | \n
\n \n | date | \n | \n | \n
\n \n \n \n | 2021-01-01 | \n 003 | \n 46021 | \n
\n \n | 2021-01-02 | \n 005 | \n 46022 | \n
\n \n | 2021-01-03 | \n 007 | \n 46022 | \n
\n \n | 2021-02-04 | \n 003 | \n 46021 | \n
\n \n | 2021-02-05 | \n 003 | \n 46027 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dataframe_3.index.set_names(['date'], inplace=True)\n",
"display(dataframe_3)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"We can use different columns as index to create a dataframe similar to a **pivot table**:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"source": [
"dataframe_4 = dataframe.set_index(['one', 'two', 'three'])\n",
"display(dataframe_4)\n",
"print(dataframe_4['date'][1][5]['Awesome'])"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% \n",
"is_executing": false
}
},
"execution_count": 19,
"outputs": [
{
"data": {
"text/plain": " date\none two three \n1 5 Awesome 2021-01-01\n2 4 Brutal 2021-01-02\n3 3 Cool 2021-01-02\n4 2 Divine 2021-02-03\n5 1 Exquisite 2021-02-05",
"text/html": "\n\n
\n \n \n | \n | \n | \n date | \n
\n \n | one | \n two | \n three | \n | \n
\n \n \n \n | 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n
\n \n | 2 | \n 4 | \n Brutal | \n 2021-01-02 | \n
\n \n | 3 | \n 3 | \n Cool | \n 2021-01-02 | \n
\n \n | 4 | \n 2 | \n Divine | \n 2021-02-03 | \n
\n \n | 5 | \n 1 | \n Exquisite | \n 2021-02-05 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": [
"2021-01-01\n"
],
"output_type": "stream"
}
]
},
{
"cell_type": "markdown",
"source": [
"#### New columns\n",
"To add a new column, you just need to define the column name. For instance, let us to the original dataframe a new column containing the month:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 20,
"outputs": [
{
"data": {
"text/plain": " one two three date month\n0 1 5 Awesome 2021-01-01 1\n1 2 4 Brutal 2021-01-02 1\n2 3 3 Cool 2021-01-02 1\n3 4 2 Divine 2021-02-03 2\n4 5 1 Exquisite 2021-02-05 2",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n three | \n date | \n month | \n
\n \n \n \n | 0 | \n 1 | \n 5 | \n Awesome | \n 2021-01-01 | \n 1 | \n
\n \n | 1 | \n 2 | \n 4 | \n Brutal | \n 2021-01-02 | \n 1 | \n
\n \n | 2 | \n 3 | \n 3 | \n Cool | \n 2021-01-02 | \n 1 | \n
\n \n | 3 | \n 4 | \n 2 | \n Divine | \n 2021-02-03 | \n 2 | \n
\n \n | 4 | \n 5 | \n 1 | \n Exquisite | \n 2021-02-05 | \n 2 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dataframe['month'] = pd.to_datetime(dataframe['date']).dt.month\n",
"display(dataframe)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"#### Grouping\n",
"Grouping allows us to group by values that have the same index value: "
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 21,
"outputs": [
{
"data": {
"text/plain": " one two\nmonth \n1 6 12\n2 9 3",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n
\n \n | month | \n | \n | \n
\n \n \n \n | 1 | \n 6 | \n 12 | \n
\n \n | 2 | \n 9 | \n 3 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ones_in_month = dataframe.set_index(['month']).groupby(['month']).sum()\n",
"display(ones_in_month)\n",
"\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"Note that, first, we define the index month, and then we group by this index. Finally, we apply the vectorized function sum to sum all values of the columns with the same value of the index month. \n",
"Column three silently disappears since its type does not support this function. \n",
"\n",
"#### Joining\n",
"Indexes allow us to join different dataframes that have an index with the same name:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 22,
"outputs": [
{
"data": {
"text/plain": " one two three month client_id postal_code\ndate \n2021-01-01 1 5 Awesome 1 003 46021\n2021-01-02 2 4 Brutal 1 005 46022\n2021-01-02 3 3 Cool 1 005 46022\n2021-02-05 5 1 Exquisite 2 003 46027",
"text/html": "\n\n
\n \n \n | \n one | \n two | \n three | \n month | \n client_id | \n postal_code | \n
\n \n | date | \n | \n | \n | \n | \n | \n | \n
\n \n \n \n | 2021-01-01 | \n 1 | \n 5 | \n Awesome | \n 1 | \n 003 | \n 46021 | \n
\n \n | 2021-01-02 | \n 2 | \n 4 | \n Brutal | \n 1 | \n 005 | \n 46022 | \n
\n \n | 2021-01-02 | \n 3 | \n 3 | \n Cool | \n 1 | \n 005 | \n 46022 | \n
\n \n | 2021-02-05 | \n 5 | \n 1 | \n Exquisite | \n 2 | \n 003 | \n 46027 | \n
\n \n
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dataframe_5 = dataframe.set_index(['date'])\n",
"dataframe_6 = dataframe_5.join(dataframe_3, how=\"inner\")\n",
"display(dataframe_6)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}